Our project

We decided to do central Colombia, basically because it is where the capital is.

We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.

We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.

Description of variables

  • ID de caso: ID of the confirmed case.

  • Fecha de diagnóstico: Date in which the disease was diagnosed.

  • Ciudad de ubicación: City where the case was diagnosed.

  • Departamento o Distrito: Department or district where the city belongs to.

  • Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.

  • Edad: Age of the confirmed case.

  • Sexo: Sex of the confirmed case.

  • Tipo: How the person got infected: in Colombia, abroad or unknown.

  • País de procedencia: Country of origin if the person got infected abroad.

Map

Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.

Preprocessing

We had to clean the dataset:

  • We transformed the Fecha de diagnóstico variable into a Date type variable,

  • we fixed the variable Id de caso (since we removed some departments, so some lines, the numbers weren’t consecutive),

  • we created a variable Grupo de edad,

  • we cleaned the column País de procedencia (replaced cities with the country) and created the variable Continente de procedencia (as the first is too fragmented we thought to consider the continents).

##   ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1          1           2020-03-06         Bogotá D.C.             Bogotá D.C.
## 2          2           2020-03-09            Medellín               Antioquia
## 3          3           2020-03-11            Medellín               Antioquia
##     Atención Edad Sexo        Tipo País de procedencia Grupo de edad
## 1 Recuperado   19    F   Importado              Italia         19_30
## 2 Recuperado   50    F   Importado              España         46_60
## 3 Recuperado   55    M Relacionado                 Nan         46_60

New dataset I

##          Date Elapsed time New cases/day Cumulative cases
## 1  2020-03-06            0             1                1
## 2  2020-03-09            3             1                2
## 3  2020-03-11            5             5                7
## 4  2020-03-12            6             2                9
## 5  2020-03-13            7             3               12
## 6  2020-03-14            8            14               26
## 7  2020-03-15            9            13               39
## 8  2020-03-16           10             8               47
## 9  2020-03-17           11            13               60
## 10 2020-03-18           12             9               69

New dataset II

##          Date Elapsed time Department Department ID New cases/day
## 1  2020-03-09            3  Antioquia             1             1
## 2  2020-03-11            5  Antioquia             1             3
## 3  2020-03-14            8  Antioquia             1             3
## 4  2020-03-15            9  Antioquia             1             1
## 5  2020-03-19           13  Antioquia             1             3
## 6  2020-03-20           14  Antioquia             1            11
## 7  2020-03-21           15  Antioquia             1             3
## 8  2020-03-22           16  Antioquia             1             5
## 9  2020-03-23           17  Antioquia             1            22
## 10 2020-03-25           19  Antioquia             1             8
##    Cumulative cases/Department Mean age
## 1                            1 50.00000
## 2                            4 35.66667
## 3                            7 30.00000
## 4                            8 55.00000
## 5                           11 52.33333
## 6                           22 39.81818
## 7                           25 31.00000
## 8                           30 45.40000
## 9                           52 36.45455
## 10                          60 29.75000

Exploring the dataset

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.


The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Other plots


The major number of cases are in the capital Bogotà.


Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).


The desease (number of cases) is more or less equally distributed across genders.


People from 31 to 45 years old are the most affected by the disease and people over 76 years old are the least affected. Colombia is a very young country. In 2018 the median age of the population was 30.4 years old and the largest age group is made of people from 25 to 54 years old, which comprises 41.98% of the population. (https://www.indexmundi.com/colombia/demographics_profile.html)

Age-Sex plot


There isn’t much difference between the sexes among the different group of ages.

Tipo

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

## # A tibble: 3 x 3
##   Tipo        `Total number` Percentage
##   <chr>                <int> <chr>     
## 1 Relacionado           8570 12%       
## 2 Importado              687 1%        
## 3 En Estudio           60406 87%

The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.

The frequentist approach

Train/test split

We splitted the data so to leave out the last three points for prediction, because we have few points and because in this models it has no sense to predict a week, because the situation changes really fast.

Using cases_relev_dep

## 
## Call:
## glm(formula = `Cumulative cases/Department` ~ `Elapsed time`, 
##     family = poisson, data = cases_relev_dep)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -101.60   -47.19   -18.96    -8.85   355.34  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.185e+00  3.698e-03    1132   <2e-16 ***
## `Elapsed time` 3.492e-02  3.455e-05    1011   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5910918  on 1050  degrees of freedom
## Residual deviance: 4404273  on 1049  degrees of freedom
## AIC: 4411978
## 
## Number of Fisher Scoring iterations: 6
## [1] "MSE: 3.77645088133327"
## [1] "AIC: 4411977.61248341"

We can see that the AIC is enormous.

Poisson with Elapsed time plus Elapsed time^2 as predictor

## [1] "Estimated overdispersion 74.6267856024938"
## [1] "RMSE: 6940.37335779742"
## [1] "AIC: 12062.7254007421"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 10862.08"

Poisson with Elapsed time, Age and Departments as predictors


## [1] "Estimated overdispersion 9305019.32906713"
## [1] "Real:  52763 Predict:  51352.1683340905"
## [2] "Real:  55062 Predict:  51926.3623882142"
## [3] "Real:  57218 Predict:  51407.7654815247"
## [4] "Real:  59929 Predict:  46143.4658501589"
## [5] "Real:  63731 Predict:  51890.9303272534"
## [6] "Real:  67003 Predict:  57193.1704598653"
## [7] "Real:  69663 Predict:  71578.2029401362"
## [1] "RMSE: 8243.70282563776"
## [1] "AIC: 18085.2709276108"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 16854.63"

ANOVA to compare the Poisson models

## Analysis of Deviance Table
## 
## Model 1: `Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2)
## Model 2: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+`
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       117      10862                          
## 2       113      19441  4  -8579.3              
## 3       102      16855 11   2586.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

##            fit        lwr        upr
## 1     330.0746   326.5389   333.6486
## 2     375.9108   372.0358   379.8261
## 3     408.9914   404.8613   413.1637
## 4     429.0613   424.8173   433.3478
## 5     449.1625   444.8197   453.5477
## 6     465.5929   461.0866   470.1433
## 7     488.4038   483.7547   493.0975
## 8     510.6072   505.8188   515.4409
## 9     533.5229   528.5940   538.4977
## 10    552.9769   547.9174   558.0831
## 11    581.7205   576.5240   586.9639
## 12    602.9698   597.5202   608.4691
## 13    631.7498   626.2286   637.3196
## 14    664.0337   658.2573   669.8607
## 15    680.7526   674.7469   686.8116
## 16    721.6001   715.5526   727.6987
## 17    757.8006   751.1651   764.4947
## 18    785.9674   779.5862   792.4009
## 19    823.5685   816.8557   830.3365
## 20    858.9395   852.1669   865.7658
## 21    876.7155   869.4557   884.0359
## 22    930.6729   923.4462   937.9562
## 23    971.7989   964.3235   979.3322
## 24   1032.8012  1022.9102  1042.7877
## 25   1054.1709  1046.2887  1062.1126
## 26   1093.3457  1085.0495  1101.7053
## 27   1146.1746  1137.1944  1155.2258
## 28   1202.9545  1194.1638  1211.8099
## 29   1258.7105  1249.6868  1267.7994
## 30   1311.3175  1302.1746  1320.5246
## 31   1357.8835  1337.6157  1378.4585
## 32   1444.9156  1434.7098  1455.1941
## 33   1519.8684  1509.1450  1530.6679
## 34   1587.8759  1575.2161  1600.6374
## 35   1630.2011  1619.8580  1640.6103
## 36   1722.7991  1712.2271  1733.4363
## 37   1824.4607  1812.4555  1836.5453
## 38   1850.1033  1837.2203  1863.0766
## 39   1989.4668  1978.1516  2000.8467
## 40   1922.1853  1901.6820  1942.9096
## 41   2115.2571  2102.3406  2128.2530
## 42   2234.8920  2221.9995  2247.8592
## 43   2331.9074  2319.7325  2344.1463
## 44   2431.4378  2418.1501  2444.7984
## 45   2634.0492  2617.5718  2650.6302
## 46   2744.4903  2729.0463  2760.0217
## 47   2873.0845  2858.0022  2888.2463
## 48   2863.4143  2838.8978  2888.1425
## 49   3176.8999  3158.6408  3195.2647
## 50   3196.9225  3179.5268  3214.4134
## 51   3893.4344  3848.6899  3938.6991
## 52   3462.5208  3440.7520  3484.4273
## 53   3635.1333  3611.3954  3659.0273
## 54   4217.3130  4170.8403  4264.3034
## 55   3936.4731  3910.4026  3962.7174
## 56   4618.8287  4576.6449  4661.4013
## 57   4231.7340  4208.6675  4254.9269
## 58   5587.4719  5496.0193  5680.4463
## 59   4879.3421  4850.5114  4908.3440
## 60   4960.9549  4940.6730  4981.3201
## 61   5581.3981  5534.3548  5628.8414
## 62   5861.2124  5820.2560  5902.4569
## 63   5574.8777  5548.6184  5601.2613
## 64   5815.6063  5786.7861  5844.5700
## 65   5987.4306  5963.8100  6011.1446
## 66   6350.1873  6307.2567  6393.4101
## 67   6620.2329  6590.0085  6650.5959
## 68   6893.5614  6852.1030  6935.2707
## 69   7214.9504  7151.4212  7279.0440
## 70   7620.4264  7584.1069  7656.9199
## 71   7782.2056  7754.0042  7810.5094
## 72   8061.2257  8006.4048  8116.4220
## 73   8720.0235  8681.6719  8758.5445
## 74   8803.8471  8761.8373  8846.0582
## 75   9212.6083  9170.4593  9254.9509
## 76   9199.1815  9162.7615  9235.7462
## 77   9898.5171  9862.4445  9934.7216
## 78   9555.7379  9468.4566  9643.8238
## 79  10760.8789 10697.6546 10824.4769
## 80  11407.1952 11292.1149 11523.4482
## 81  11691.4576 11634.5050 11748.6889
## 82  12514.5215 12422.7444 12606.9767
## 83  12664.4871 12587.9120 12741.5281
## 84  12963.7257 12861.3620 13066.9042
## 85  13857.2611 13788.4567 13926.4089
## 86  15154.9754 15074.7667 15235.6108
## 87  15372.5241 15258.6298 15487.2686
## 88  15791.9001 15735.9156 15848.0838
## 89  17040.9553 16938.5110 17144.0193
## 90  17408.0264 17282.3735 17534.5929
## 91  17992.8830 17904.5528 18081.6489
## 92  18912.1907 18757.7983 19067.8539
## 93  19746.3114 19658.7003 19834.3129
## 94  20229.3760 20142.0126 20317.1184
## 95  21782.0187 21651.6274 21913.1952
## 96  21889.2743 21779.7893 21999.3096
## 97  21468.2660 21382.5634 21554.3121
## 98  22265.3032 22061.1384 22471.3574
## 99  24673.4640 24549.2250 24798.3317
## 100 25209.7074 25084.5878 25335.4510
## 101 25271.9867 25123.4190 25421.4330
## 102 26106.3282 25843.9985 26371.3207
## 103 28544.5258 28313.5482 28777.3876
## 104 29689.4939 29567.0004 29812.4948
## 105 29272.5660 28999.4995 29548.2037
## 106 31186.0038 30939.5415 31434.4293
## 107 35656.0034 35413.4774 35900.1903
## 108 35661.4994 35411.5718 35913.1910
## 109 34539.8290 34282.2070 34799.3868
## 110 36406.1526 36196.3948 36617.1260
## 111 37543.2985 37243.1994 37845.8157
## 112 40120.3228 39846.2368 40396.2941
## 113 41198.9581 40877.9624 41522.4745
## 114 46613.7677 46353.8635 46875.1292
## 115 40796.5908 40468.9743 41126.8596
## 116 45143.2410 44771.5259 45518.0422
## 117 46953.8524 46568.4419 47342.4527
## 118 47823.5956 47454.0505 48196.0185
## 119 51599.4931 51211.6249 51990.2989
## 120 51352.1683 50993.4456 51713.4146

Predictive accuracy of the Poisson model for New cases/day

Predicting with a \(95\%\) confidence interval


Negative Binomial with Elapsed time, Age and Departments as pedictors


## [1] "Estimated overdispersion 1.62538490690911"
## [1] "RMSE: 1379.56664915407"
## [1] "AIC: 1435.14403991019"
## [1] "Null deviance:  1416.29"   "Residual deviance: 146.46"

Applying ANOVA to compare the negative binomial models

We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the fifth model is in fact better than the first model.

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Cumulative cases
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Elapsed time
## 2                                                                                                                                                                                                                                                                                                                                                                                                                         `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n    `Departamento o Distrito_Tolima`
##      theta Resid. df    2 x log-lik.   Test    df LR stat. Pr(Chi)
## 1 11252780       118       -21905.18                              
## 2 12919543       113       -20629.27 1 vs 2     5 1275.907       0
## 3 13728063       102       -18042.49 2 vs 3    11 2586.780       0

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval


Predictive accuracy of the NB model for New cases/day

Predicting with a \(95\%\) confidence interval


The Bayesian approach

Poisson regression

As a first attempt, we fit a simple Poisson regression:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.

For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.

Posterior predictive check

poisson_posterior-1_new

The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.

Residual check

first_residual-1_new

The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).

The plot of the standardized residuals indicates a large amount of overdispersion.

Classically the problem of having overdispersed data is solved using the negative binomial model instead of the Poisson’s one.

Negative Binomial model

We try to improve the previous model using the Negative Binomial model:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

Where the parameter \(\phi\) is called precision and it is such that:

\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]

again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.

The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.

Accuracy across departments

NB_deps_new

We should take into account the differences across departments.

Multilevel Negative Binomial regression

We try to fit the following model, which also includes Age as covariat:

\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]

Hierarchical model

In order to improve the fit, we fit a model with department-specific intercept term.

So the varying intercept model that we take into account is now:

\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]

The priors used for the above model are the following:

\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]

being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).

New dataset

We added the following covariats into the dataset:

  • People: millions of inhabitants for each region;

  • Surface: \(km^3\), extent of each region;

  • Density: \(\frac{people}{km^2}\), density of the population in each region.

The model is:

Accuracy across departments

hier_deps_new

We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.

LOOIC

The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.

Plot the looic to compare models:

looic_new